mtmd: add batching API by ngxson · Pull Request #24384 · ggml-org/llama.cpp

ngxson · 2026-06-09T23:06:24Z

Overview

Supersede #24300

Also fix #24380

Add a generic batching API to mtmd and wire it up to llama-server, the goal is to speed up llava-uhd-style models and at the same time, improve video processing speed

Current state:

llama-server can use it correctly
mtmd API implement is mock up, need to implement the proper logic

TODO:

add notion of max batch size in mtmd
add CLI argument for it
mtmd_batch_add_chunk should only accept input with same size
wire up mtmd_batch_encode to use the 4th batch dim, added via mtmd: build_vit batching #24352
blacklist / whitelist models that can support it --> maybe only support build_vit() models for now
update mtmd-cli to use batching API --> skip, we don't actually need that

How it works

create a new mtmd_batch object
call mtmd_batch_add_chunk until it returns an error (either batch is full or current chunk can't be batched)
call mtmd_batch_encode on the batch
get the encoded embeddings via mtmd_batch_get_output_embd

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: no

sfallah · 2026-06-11T07:23:40Z

Hi @ngxson,

I just wanted to thank you for the time and patience you put into reviewing my PRs. I have learned a lot about llama.cpp in general, but especially mtmd, through that work. I would like to use that experience to help the team.

If you would trust me with it, I would be glad to help with refactoring like #24384, and with the follow-up of migrating the existing models to the new batching API. The migration part especially feels like a good fit for what I have learned.

No pressure either way — just tell me the shape you want and I will follow it.

Also, related to this: I did some profiling on whether batching gives a significant speed gain, and on the GPU memory overhead, testing on an M3 Max and a few small Nvidia GPUs. On small consumer-grade GPUs the speed gain was not large. Happy to share the numbers if useful.

ngxson · 2026-06-11T17:39:13Z

@sfallah yes I'd appreciate if you can adapt the deepseek-ocr model similar to gemma4v.cpp in this PR

some important notes:

I'm testing with batching up to 9 images in one encoder pass. on my macbook m5, I see almost no gain in performance
only images (or tiles) with the same size can be batched together; IIRC, ds-ocr v2 has a bigger overview image, but I could be wrong
the batch size is conditioned by the number of output tokens, so it's expected that some tiles cannot be processed in the same batch with the other. this is to prevent user from complaining with mmproj uses excessive memory, but we can gain some more space (and raise the batch size) if my other PR is accepted mtmd, llama: shared backend sched #24361

sfallah · 2026-06-12T09:00:45Z

@ngxson

only images (or tiles) with the same size can be batched together; IIRC, ds-ocr v2 has a bigger overview image, but I could be wrong

yes, in fact both ds-ocr versions have bigger (1024x1024) overview images.

the batch size is conditioned by the number of output tokens, so it's expected that some tiles cannot be processed in the same batch with the other. this is to prevent user from complaining with mmproj uses excessive memory

ds-ocr v2 doesn't have any issue with this, like most llava uhd style (tile slicing) models I guess.
But ds-ocr v1 concats image_newline to every token row across the whole grid width (a row mixes tokens from all tiles in that grid row), i.e. we can at minimum encode a tile grid row at a time.

ngxson · 2026-06-12T09:17:40Z

ds-ocr v2 doesn't have any issue with this, like most llava uhd style (tile slicing) models I guess.
But ds-ocr v1 concats image_newline to every token row across the whole grid width (a row mixes tokens from all tiles in that grid row), i.e. we can at minimum encode a tile grid row at a time.

no I think I misunderstood my point:

any llava-uhd style models slice the input image into multiple smaller (always square) image. for example, one big image can be sliced into 9 tiles.

without batching, cgraph only need enough memory to hold 1 image in a single decode pass. batch of 9 images mean you now need 9x memory, which can be too large, so the batch will be conditioned by the number of output tokens; it's not the most reliable way, but roughly correct. so for example, the batch will be conditioned to max 6 tiles, that means the image will be processed in 2 batches: 6 + 3

the main point is that this limit is expected and should always be respected by all models

sfallah · 2026-06-12T10:10:12Z

@ngxson
no misunderstanding.
I think I already knew what you meant.
I tried to explain that we have this extra constraint with ds-ocr v1: we can't encode tile-wise, we can at minimum encode a tile-grid row at a time.
So if we let's say have grid_x=2, grid_y=4 (a 2x4 grid; max-tiles is 9), the min batch size is 2 tiles (one grid row).
I.e. the n_tokens of one grid row (the two tiles plus the woven newlines) should fit under batch_max_tokens. And if the limit is smaller than even one row, the row gets encoded anyway - same soft behavior as your "first image will always be added" rule.

I have it almost ready, I will create a DRAFT PR so you can see it in the code.

ngxson · 2026-06-12T11:20:47Z

I tried to explain that we have this extra constraint with ds-ocr v1: we can't encode tile-wise, we can at minimum encode a tile-grid row at a time.

I don't see why we can't. you are assuming that all images in the batch must have the same number of output tokens, but that is not the case.

the batching system is flexible such that images with different number of output tokens can be different for each image in the batch. that means even one image in the batch have newline and the rest doesn't have, there is no problem at all.

assuming that a whole row need to be encoded will make the logic to be model-specific. there is always cases where you can absolutely fit multiple rows in the same batch (i.e. user simply allow larger batch)

all you need to do is to insert the newline to the correct index in the output, it can be done simply by having a loop to concat view 3d output [n_embd, n_tokens, n_batch] as slices of [n_embd, n_tokens], then concat them back while inserting a newline conditionally

and that even work if the batch is not row-aligned, for example output can be: [tile, tile, newline, tile, tile]

sfallah · 2026-06-12T16:53:51Z

all you need to do is to insert the newline to the correct index in the output, it can be done simply by having a loop to concat view 3d output [n_embd, n_tokens, n_batch] as slices of [n_embd, n_tokens], then concat them back while inserting a newline conditionally

That is how my first implementation of ds-ocr dynamic resolution worked, see encode_deepseekocr_v1:
https://github.com/sfallah/llama.cpp/blob/sf/deepseek-ocr-mul-tile-dyn-res/tools/mtmd/mtmd.cpp#L136-L178
Tiles are encoded independently, newlines are inserted afterwards (host-side there). I linked this branch in the description of #24300 and explained why I moved away from it:

I have prepared a sequential (non-batched) alternative (see sf/deepseek-ocr-mul-tile-dyn-res) that I consider to be a hack. That is why I followed this path more seriously.

To be precise about the layout: a tile's tokens are not contiguous in the final output, so the loop has to interleave tile rows, not concat whole tiles.

And that was exactly my problem: it is ugly, model-specific and lives in mtmd. "Insert the newline to the correct index" is the model-specific part; the loop that knows the indices has to live somewhere. So my question to you: where would you put this weaving/assembly so that it stays clean and model-agnostic?

I have both variants working now (in-graph weave with row-aligned batches, flat tile batches with assembly-time weave); identical OCR output either way, including non-row-aligned splits. Draft PR coming so you can see it in code.

ngxson · 2026-06-12T17:10:10Z

please correct if I'm wrong, but let's take the non-batching version as the ground truth:

for ds-ocr-v1:

llama.cpp/tools/mtmd/models/deepseekocr.cpp

Line 315 in ebc1077

    
           cur   = ggml_concat(ctx0, cur, model.view_seperator, 1);  // (n_dim, h*(w+1) + 1)

the view_seperator is always appended unconditionally to the tile, so I imagine the output will be: [tile, view_seperator, tile, view_seperator, tile, view_seperator, ...]

on the batched version, you can do that by simply concat the view_seperator to 2nd dim, it will be broadcasted to the 3rd dim (batch dim), so any other problems with it? just a simple ggml_concat()

for v2:

llama.cpp/tools/mtmd/models/deepseekocr2.cpp

Lines 72 to 75 in ebc1077

    
           // view_seperator only after the global view 
        
           if (img.add_viewsep) { 
        
               cur = ggml_concat(ctx0, cur, model.view_seperator, 1); // (n_dim, 257) 
        
           }

the view_sep is only added for the overview image, which won't be batched anyway (because it's bigger than the tiles), so I think we don't even need to do a loop to assemble it. upon encoding the overview image, we can simply add the view_seperator to all images in the batch (because other images should also be the overview, they are all the same size)

To be precise about the layout: a tile's tokens are not contiguous in the final output, so the loop has to interleave tile rows, not concat whole tiles.

why they aren't contiguous? IIUC output is [n_embd, n_tokens_per_image, n_batch], so they should follow the same order as the input

sfallah · 2026-06-12T17:38:04Z

I think we are mixing two different things here:

Batching multiple complete input images. Every encoded input image produces an output block of the same shape (for single-view v1 always: global view rows + one newline per row + one view_seperator at the end), so any number of them can be stacked -- view_seperator included, one per image. No problem there, and not what I am talking about.
Encoding the tiles of ONE multi-tile input image. Here the tiles share one image_newline weave that spans across the tiles. This is the only hard part, and it is about image_newline, not view_seperator.

why they aren't contiguous? IIUC output is [n_embd, n_tokens_per_image, n_batch], so they should follow the same order as the input

Because the HF reference rearranges the tile features into the full image grid before inserting the newlines, see modeling_deepseekocr.py:
https://huggingface.co/deepseek-ai/DeepSeek-OCR/blob/main/modeling_deepseekocr.py

local_features = local_features.view(height_crop_num, width_crop_num, h2, w2, n_dim2).permute(0, 2, 1, 3, 4).reshape(height_crop_num*h2, width_crop_num*w2, n_dim2)
local_features = torch.cat(
    [local_features, self.image_newline[None, None, :].expand(height_crop_num * h2, 1, n_dim2)], dim=1
)
...
global_local_features = torch.cat([local_features, global_features, self.view_seperator[None, :]], dim=0)

The permute(0, 2, 1, 3, 4) swaps the tile-column axis with the row-within-tile axis: the result is the stitched image as one grid of height_crop_num*h2 rows, and image_newline goes after each of these full-width rows. So a tile's tokens are spread over h2 different rows of the final output. And view_seperator appears exactly once per input image, at the very end after the global view (last line above).

I am doing the same in ggml in my batched-encode branch (the #24300 one), see:
https://github.com/sfallah/llama.cpp/blob/sf/dsocr-mul-tile-batched-encode/tools/mtmd/models/deepseekocr.cpp#L315-L324

cur = ggml_reshape_4d(ctx0, cur, n_dim * tile_w, tile_w, grid_x, grid_y); // [n_dim*tile_w, tile_w, grid_x, grid_y]
cur = ggml_cont(ctx0, ggml_permute(ctx0, cur, 0, 2, 1, 3));
...
nl  = ggml_repeat_4d(ctx0, model.image_newline, n_dim, 1, gh, 1);
cur = ggml_reshape_3d(ctx0, cur, n_dim, gw, gh); //[n_dim, gw, gh]
cur = ggml_concat(ctx0, cur, nl, 1);

Also, master cannot be the ground truth for multi-tile: master's ds-ocr v1 is single-view only, the multi-tile path is what #24300 added. The unconditional view_seperator at the line you linked is correct there because the only image master ever encodes is the global view -- that is exactly why my branch gates it on add_viewsep.

For v2 I agree on the layout: tiles have no newline weave, their raw concatenation is already the final output, and view_seperator only follows the overview -- that is what my implementation does.

My implementation of exactly this layout scores CER 0.0000 against the HF output on the multi-tile eval -- with a tile-contiguous layout that match would not be possible.

ngxson · 2026-06-12T17:47:16Z

Also, master cannot be the ground truth for multi-tile: master's ds-ocr v1 is single-view only, the multi-tile path is what #24300 added.

tbh I'm not a fan of introducing 2 changes in one PR. this is the exact root cause of the miscommunication in the past N message between us. if you have 2 different changes, please make it very clear.

what I understand is that there are 2 different subjects:

for v1, you want to add multi-tile path --> it can still work without batching (independently), correct?
for v2 (and v1-multi-tile), you want multiple tiles to be processed in one batch, correct?

for v1-multi-tile, please push a PR without batching support first. I will not proceed until I understand what it does.

for v2-batching, it should be the same case as existing llava-uhd model --> no siginificant problem, right?

sfallah · 2026-06-12T17:55:58Z

what I understand is that there are 2 different subjects:

for v1, you want to add multi-tile path --> it can still work without batching (independently), correct?

for v2 (and v1-multi-tile), you want multiple tiles to be processed in one batch, correct?

Correct on both.

v1 multi-tile works without your batching API. I will push it as a standalone PR against master so it can be reviewed on its own. For the weave I will use your method 1, since you prefer it -- I have that variant implemented and validated.
Batching the tiles is then a thin layer on top of the standalone PR, and yes, for v2 it is the same case as llava-uhd, no significant problem.

Agreed on splitting the PRs.

sfallah · 2026-06-12T18:14:05Z

@ngxson

BTW the PR that you closed rather abruptly (#24300) already included functionally everything for a proper DSOCR v1+v2 dynamic-resolution multi-tile batched encoding of tiles in parity with HF reference impls -- carefully crafted, with a solid regression test, perf testing and profiling.

The non-batched PR you are asking for is essentially my dyn-res branch (https://github.com/sfallah/llama.cpp/tree/sf/deepseek-ocr-mul-tile-dyn-res); the batched layer on top is exactly what #24300 did.

But I understand your point about splitting the PRs, so I will do that. I just want to make sure we are on the same page about the content of each PR and the implications of the different approaches. Concretely: method 1 (the in-graph weave) needs the whole grid in one graph, so the non-batched first PR will follow the dyn-res approach; the in-graph weave then belongs to the batched layer.

ngxson · 2026-06-12T18:51:14Z

The non-batched PR you are asking for is essentially my dyn-res branch (https://github.com/sfallah/llama.cpp/tree/sf/deepseek-ocr-mul-tile-dyn-res); the batched layer on top is exactly what #24300 did.

I might be a bit hard here, but the value of open source contribution is not only about "the code works", but also about planning and communication. you cannot expect pushing a large PR that contains multiple (unrelated) changes and having someone else to fully understand it, that's not how code review work.

your own comment #24300 (comment) also pointed out independent changes that can be split to smaller PR, why don't we do that instead? not necessarily 4 separate PRs, but you get the idea. I did acknowledge the first one #24352 and as a proof: the review for that PR was straight-forward.

Concretely: method 1 (the in-graph weave) needs the whole grid in one graph, so the non-batched first PR will follow the dyn-res approach; the in-graph weave then belongs to the batched layer.

I don't quite understand your intent here, but to make it clear: I expect the first version that simply doesn't use the 4th dim (n_batch); that dim should always be 1 in the cgraph

I imagine such change will affect just 2 places:

preprocessor of ds-ocr-v1
cgraph of ds-ocr-v1, to conditionally add view_seperator, similar to the if (img.add_viewsep) on v2

ngxson · 2026-06-12T18:55:22Z

merging this change after CI passes, the tests.sh is also passed:

[vision] OK:   ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[vision] OK:   ggml-org/LFM2-VL-450M-GGUF:Q8_0
[vision] OK:   ggml-org/granite-docling-258M-GGUF:Q8_0
[vision] OK:   ggml-org/LightOnOCR-1B-1025-GGUF:Q8_0
[vision] OK:   ggml-org/DeepSeek-OCR-GGUF:Q8_0
[vision] OK:   ggml-org/dots.ocr-GGUF:Q8_0
[vision] OK:   ggml-org/HunyuanOCR-GGUF:Q8_0
[vision] OK:   ggml-org/gemma-4-E2B-it-GGUF:Q8_0
[audio]  OK:   ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M
[audio]  OK:   ggml-org/LFM2-Audio-1.5B-GGUF:Q8_0
[audio]  OK:   ggml-org/gemma-4-E2B-it-GGUF:Q8_0
[audio]  OK:   ggml-org/Qwen3-ASR-0.6B-GGUF:Q8_0

mtmd: add batching API

b62c305

github-actions Bot added examples server labels Jun 9, 2026

ngxson mentioned this pull request Jun 9, 2026

mtmd: DeepSeek-OCR multi-tile dynamic resolution batched encoding #24300

Closed

ngxson added 11 commits June 11, 2026 13:24

wip

111d3f1

Merge branch 'master' into xsn/mtmd_batch_api

eb2dab2

first working version (gemma4v)

f77cfd7

add arg

190bef3

nits

a773d7b

wire up support_batch()

3eecd67

fix 0.0 output embd

7a22484

fix audio

2dd581a

nits

de656cc

refactor a bit

67d4335

nits

0d6bc77

ngxson marked this pull request as ready for review June 11, 2026 17:33

ngxson requested review from a team as code owners June 11, 2026 17:33

ngxson mentioned this pull request Jun 12, 2026

server : unify mtmd image processing with post-decode callback #24520

Draft

1 task

fix non-batching case

b3a5ca9

This comment was marked as outdated.

Sign in to view

fix comment

4cf7759

ngxson merged commit e37abd6 into ggml-org:master Jun 12, 2026
25 checks passed

Conversation

ngxson commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

How it works

Requirements

Uh oh!

sfallah commented Jun 11, 2026

Uh oh!

ngxson commented Jun 11, 2026

Uh oh!

sfallah commented Jun 12, 2026

Uh oh!

ngxson commented Jun 12, 2026

Uh oh!

sfallah commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfallah commented Jun 12, 2026

Uh oh!

ngxson commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

sfallah commented Jun 12, 2026

Uh oh!

ngxson commented Jun 12, 2026

Uh oh!

sfallah commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfallah commented Jun 12, 2026

Uh oh!

ngxson commented Jun 12, 2026

Uh oh!

ngxson commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ngxson commented Jun 9, 2026 •

edited

Loading

sfallah commented Jun 12, 2026 •

edited

Loading

ngxson commented Jun 12, 2026 •

edited

Loading

ngxson commented Jun 12, 2026 •

edited

Loading

sfallah commented Jun 12, 2026 •

edited

Loading